Evidence based, data based thinking
The study of the entire economy of a region, a country, or the entire world
Open social science database.
Datascience and Artificial Intelligence
Computer Resources
Data
Big Data
Open and Public Data
Q. Explain ‘Open Data’ and ‘Public Data’, and their differences. AI: Poe Assistant.
A. AI Poe Assistant. (Poe.com)
“Open data” and “public data” are related concepts that refer to different aspects of data accessibility and usage.
Open Data: Open data refers to the idea that certain data should be freely available to the public, without restrictions on access, usage, or redistribution. Open data is typically released in a machine-readable format, such as CSV, JSON, or XML, and is accompanied by an open license that allows anyone to use, reuse, and distribute the data for any purpose. The goal of open data is to promote transparency, collaboration, and innovation by enabling individuals, organizations, and communities to analyze and build upon the data.
Open data often includes government data, scientific research data, public records, and other datasets that are deemed to be of public interest. Governments, international organizations, and various institutions actively promote the release of open data to foster accountability, drive economic growth, and empower citizens to make informed decisions.
Public Data: Public data, on the other hand, refers to any data that is accessible or available to the public. It is a broader term that encompasses both open data and other types of data that may have certain restrictions on access or usage. Public data can include information that is freely and openly available, as well as data that is accessible through specific channels, such as public libraries, government agencies, or subscription-based services.
Public data can include a wide range of information, such as government reports, census data, legal documents, financial statements, and more. While public data may be accessible to the public, it does not necessarily guarantee the same level of openness, usability, or freedom as open data. Public data may have limitations on its usage, copyright restrictions, or conditions imposed by the data provider.
In summary, open data specifically refers to data that is freely accessible, usable, and redistributable without restrictions, often accompanied by an open license. Public data, on the other hand, is a broader term that includes both open data and other data that may have varying degrees of accessibility and usage restrictions.
Visit Open Government Data Toolkit.
Posit.cloud: https://posit.cloud/
Shared Project: https://posit.cloud/content/5539763
Create Your Copy and Download it.
Step 1. Install packages if necessary.
install.packages("tidyverse")
install.packages("WDI")
Step 2. Load packages.
library(tidyverse)
library(WDI)
Step 3. Create a data directory for the first time.
dir.create("data")
Step 4. Set ‘System Language’ to be English, recommended.
Sys.setenv(LANG = "en")
The following code chunk is to download GDP data with the following indicator code.
WDI indicator: NY.GDP.MKTP.PP.KD
df_gdp <- WDI(indicator = "NY.GDP.MKTP.PP.KD")
N.B. There are many GDP related data in WDI.
To avoid the internet traffic, save the data and reuse it.
CSV: comma separated values, a text format of a data.
write_csv(df_gdp, "data/gdp.csv")
Run codes above only once to download and write the data into the data directory.
df_gdp <- read_csv("data/gdp.csv")
Rows: 16758 Columns: 5── Column specification ───────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, iso2c, iso3c
dbl (2): year, NY.GDP.MKTP.PP.KD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head: print the first 6 rows by default
head(df_gdp)
2.561800e+12 is in scientific notation, i.e., 2.561800 \(\times10^{12} = 2,562,800,000,000\).
str: display the structure of an object
str(df_gdp)
spc_tbl_ [16,758 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
$ iso2c : chr [1:16758] "ZH" "ZH" "ZH" "ZH" ...
$ iso3c : chr [1:16758] "AFE" "AFE" "AFE" "AFE" ...
$ year : num [1:16758] 2022 2021 2020 2019 2018 ...
$ NY.GDP.MKTP.PP.KD: num [1:16758] 2.56e+12 2.47e+12 2.37e+12 2.43e+12 2.38e+12 ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. NY.GDP.MKTP.PP.KD = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary: display the summary of an object
summary(df_gdp)
In RNotebook, the following also displays the first 1000 rows of the data in the paged format.
df_gdp
|> is called a pipe operator and the following is
same as
filter(df_gdp, country == COUNTRY) .
filter : Keep rows that match a condition
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY)
ggplot + geom_line: A tidyverse function of draw a line
graph
aes(year, NY.GDP.MKTP.PP.KD) : aesthetic mapping sending
year to x-axis and NY.GDP.MKTP.PP.KD to y-axis
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Let’s delete the rows with missing values using drop_na(NY.GDP.MKTP.PP.KD). A transformation.
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Observations and Questions
e.g. The GDP of the world is continuously increasing since 1990.
By country names
COUNTRIES <- c("Japan", "China", "India", "United Kingdom", "United States", "Germany", "France")
df_gdp |> filter(country %in% COUNTRIES) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c("JP", "CN", "ID", "UK", "US", "DE", "FR")
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
What happens if you replace color = iso2c at the bottom
of the code above with colour = iso2c ,
color = country , col = country ?
df_gdp |> distinct(country, iso2c)
Set COUNTRIES and/or ISO2C to draw line graphs of GDP.
COUNTRIES <- c() # surround the country name with quotation marks, and use a comma as a separator
df_gdp |> filter(country %in% COUNTRIES) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c() # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
World Bank Home Page
Excel Files
API Search
WDIsearch(string = "gdp", field = "name")
WDIsearch(string = "NY.GDP.MKTP.PP.KD", field = "indicator")
GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD
Population, total: SP.POP.TOTL
Calculate GDP per Capita
GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD
Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
df_dgppcap <- read_csv("data/gdppcap.csv")
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, gdp)) + geom_line()
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, pop)) + geom_line()
Write your observations.
df_gdppcap2 <- df_gdppcap |> drop_na(pop) |>
mutate(gdppcap = gdp/pop, .before = gdp)
df_gdppcap2
COUNTRY <- "World"
df_gdppcap2 |> filter(country == COUNTRY) |>
ggplot(aes(year, gdppcap)) + geom_line()
df_gdppcap_check <- WDI(indicator = c(PCAP = "NY.GDP.PCAP.PP.KD", gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL"), extra = TRUE) |>
drop_na(pop) |>
mutate(gdppcap = gdp/pop, .before = gdp)
write_csv(df_gdppcap_check, "data/gdppcap_check.csv")
df_gdppcap_check <- read_csv("data/gdppcap_check.csv")
Rows: 16665 Columns: 16── Column specification ───────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (7): year, PCAP, gdppcap, gdp, pop, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df_gdppcap_check |> drop_na(gdppcap) |> mutate(near = near(PCAP, gdppcap)) |>
summarize(n = n(), sum(near))
Two useful questions.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram()
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram()
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram() + scale_x_log10()
Change bins, i.e., geom_histogram(bins = 20), etc.
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
ggplot(aes(gdp)) + geom_histogram(bins = 20) + scale_x_log10()
Create a similar histogram by using scale_x_log10() and
adjusting the number of bins.
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_histogram() + scale_x_log10()
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
ggplot(aes(gdppcap)) + geom_boxplot() + scale_x_log10()
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdppcap) |>
filter(income != "Aggregates") |>
ggplot(aes(gdppcap, income, fill = income)) + geom_boxplot() + scale_x_log10() +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020) |> drop_na(gdp) |>
filter(income != "Aggregates") |>
ggplot(aes(gdp, region, fill = region)) + geom_boxplot() + scale_x_log10() +
theme(legend.position = "none")
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region)) + geom_point() +
scale_x_log10() + scale_y_log10()
install.packages("plotly")
library(plotly)
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
becomes difficult to discriminate; you have 7. Consider specifying shapes manually if
you must have them.
CO2 emissions (metric tons per capita): EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
CO2 emissions (metric tons per capita) Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring. EN.ATM.CO2E.PC
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_co2gdp <- WDI(indicator = c(co2pcap = "EN.ATM.CO2E.PC", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_co2gdp, "data/co2gdp.csv")
df_co2gdp <- read_csv("data/co2gdp.csv")
COUNTRY <- "World"
df_co2gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, co2pcap)) + geom_line()
df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point()
df_co2gdp |> filter(year == 2020) |> drop_na(co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_co2gdp |> filter(year == 2020) |>
drop_na(gdppcap, co2pcap) |>
ggplot(aes(gdppcap, co2pcap)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10() + scale_y_log10()
df_co2gdp |> lm(co2pcap~gdppcap, data = _) |> summary()
Call:
lm(formula = co2pcap ~ gdppcap, data = df_co2gdp)
Residuals:
Min 1Q Median 3Q Max
-15.7271 -0.9824 -0.5867 0.6631 27.3247
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.349e-01 4.635e-02 9.383 <2e-16 ***
gdppcap 2.357e-04 1.955e-06 120.573 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.875 on 6909 degrees of freedom
(9847 observations deleted due to missingness)
Multiple R-squared: 0.6779, Adjusted R-squared: 0.6778
F-statistic: 1.454e+04 on 1 and 6909 DF, p-value: < 2.2e-16
School enrollment, secondary (% gross): SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $): NY.GDP.PCAP.PP.KD
School enrollment, secondary (% gross) Gross enrollment ratio is the ratio of total enrollment, regardless of age, to the population of the age group that officially corresponds to the level of education shown. Secondary education completes the provision of basic education that began at the primary level, and aims at laying the foundations for lifelong learning and human development, by offering more subject- or skill-oriented instruction using more specialized teachers. SE.SEC.ENRR
GDP per capita, PPP (constant 2017 international $) GDP per capita based on purchasing power parity (PPP). PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP at purchaser’s prices is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.PCAP.PP.KD
df_secgdp <- WDI(indicator = c(sec = "SE.SEC.ENRR", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_secgdp, "data/secgdp.csv")
df_secgdp <- read_csv("data/secgdp.csv")
COUNTRY <- "World"
df_secgdp |> filter(country == COUNTRY) |>
ggplot(aes(year, sec)) + geom_line()
df_secgdp |> filter(year == 2020) |> drop_na(sec) |>
ggplot(aes(gdppcap, sec)) + geom_point()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
scale_x_log10()
df_secgdp |> filter(year == 2020) |> drop_na(gdppcap, sec) |>
ggplot(aes(gdppcap, sec)) + geom_point() +
geom_smooth(method = "lm", formula = 'y~x', se = FALSE) +
scale_x_log10()
df_secgdp |> lm(sec~gdppcap, data = _) |> summary()
Call:
lm(formula = sec ~ gdppcap, data = df_secgdp)
Residuals:
Min 1Q Median 3Q Max
-120.077 -15.876 3.865 17.126 85.165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.678e+01 4.372e-01 129.88 <2e-16 ***
gdppcap 9.878e-04 1.682e-05 58.71 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.43 on 5366 degrees of freedom
(11390 observations deleted due to missingness)
Multiple R-squared: 0.3911, Adjusted R-squared: 0.391
F-statistic: 3447 on 1 and 5366 DF, p-value: < 2.2e-16